{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Document Processing with AutoML and Vision API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Statement\n",
    "Formally the brief for this Open Project could be stated as follows: Given a collection of varying pdf/png documents containing similar information, create a pipeline that will extract relevant entities from the documents and store the entities in a standardized, easily accessible format. \n",
    "\n",
    "The data for this project is contained in the Cloud Storage bucket [gs://document-processing/patent_dataset.zip](https://storage.googleapis.com/document-processing/patent_dataset.zip). The file [gs://document-processing/ground_truth.csv](https://storage.googleapis.com/document-processing/ground_truth.csv) contains hand-labeled fields extracted from the patents. \n",
    "The labels in the ground_truth.csv file are filename, category, publication_date, classification_1, classification_2, application_number, filing_date, priority, representative, applicant, inventor, titleFL, titleSL, abstractFL, and publication_number\n",
    "\n",
    "Here is an example of two different patent formats:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table><tr>\n",
    "<td> <img src=\"eu_patent.png\" alt=\"Drawing\" style=\"width: 600px;\"/> </td>\n",
    "<td> <img src=\"us_patent.png\" alt=\"Drawing\" style=\"width: 600px;\"/> </td>\n",
    "</tr></table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Flexible Solution\n",
    "There are many possible ways to develop a solution to this task which allows students to touch on various functionality and GCP tools that we discuss during the ASL, including the Vision API, AutoML Vision, BigQuery, Tensorflow, Cloud Composer, PubSub. \n",
    "\n",
    "For students more interested in modeling with Tensorflow, they could build a classification model from scratch to regcognize the various type of document formats at hand. Knowing the document format (e.g. US or EU patents as in the example above), relevant entities can then be extracted using the Vision API and some basic regex extactors. It might also be possible to train a [conditional random field in Tensorflow](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf) to learn how to tag and extract relevant entities from text and the given labels, instead of writing regex-based entity extractors for each document class. \n",
    "\n",
    "Students more interested in productionization could work to use Cloud Functions to automate the extraction pipeline. Or incorporate PubSub so that when a new document is uploaded to a specific GCS bucket it is parsed and the entities uploaded to a BigQuery table. \n",
    "\n",
    "Below is a solution outline that uses the Vision API and AutoML, uploading the extracted entities to a table in BigQuery. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install AutoML package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Caution:** Run the following command and **restart the kernel** afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "google-cloud-automl==0.1.2\r\n"
     ]
    }
   ],
   "source": [
    "!pip freeze | grep google-cloud-automl==0.1.2 || pip install --upgrade google-cloud-automl==0.1.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set the correct environment variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following variables should be updated according to your own enviroment:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT_ID = \"asl-open-projects\"\n",
    "SERVICE_ACCOUNT = \"entity-extractor\"\n",
    "ZONE = \"us-central1\"\n",
    "AUTOML_MODEL_ID = \"ICN6705037528556716784\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following variables are computed from the one you set above, and should\n",
    "not be modified:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "PWD = os.path.abspath(os.path.curdir)\n",
    "\n",
    "SERVICE_KEY_PATH = os.path.join(PWD, \"{0}.json\".format(SERVICE_ACCOUNT))\n",
    "SERVICE_ACCOUNT_EMAIL=\"{0}@{1}.iam.gserviceaccount.com\".format(SERVICE_ACCOUNT, PROJECT_ID)\n",
    "\n",
    "# Exporting the variables into the environment to make them available to all the subsequent cells\n",
    "os.environ[\"PROJECT_ID\"] = PROJECT_ID\n",
    "os.environ[\"SERVICE_ACCOUNT\"] = SERVICE_ACCOUNT\n",
    "os.environ[\"SERVICE_KEY_PATH\"] = SERVICE_KEY_PATH\n",
    "os.environ[\"SERVICE_ACCOUNT_EMAIL\"] = SERVICE_ACCOUNT_EMAIL\n",
    "os.environ[\"ZONE\"] = ZONE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Switching the right project and zone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\n",
      "Updated property [compute/region].\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud config set project $PROJECT_ID\n",
    "gcloud config set compute/region $ZONE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a service account"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               entity-extractor@asl-open-projects.iam.gserviceaccount.com\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud iam service-accounts list | grep $SERVICE_ACCOUNT ||\n",
    "gcloud iam service-accounts create $SERVICE_ACCOUNT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grant service account project ownership "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: We should ideally restrict the permissions to AutoML and Vision roles only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bindings:\n",
      "- members:\n",
      "  - serviceAccount:automl-vision@asl-open-projects.iam.gserviceaccount.com\n",
      "  - user:dherin@google.com\n",
      "  - user:munn@google.com\n",
      "  role: roles/automl.admin\n",
      "- members:\n",
      "  - serviceAccount:automl-vision@asl-open-projects.iam.gserviceaccount.com\n",
      "  role: roles/automl.editor\n",
      "- members:\n",
      "  - serviceAccount:service-171999062104@gcp-sa-automl.iam.gserviceaccount.com\n",
      "  role: roles/automl.serviceAgent\n",
      "- members:\n",
      "  - serviceAccount:custom-vision@appspot.gserviceaccount.com\n",
      "  role: roles/ml.admin\n",
      "- members:\n",
      "  - serviceAccount:entity-extractor@asl-open-projects.iam.gserviceaccount.com\n",
      "  - user:dherin@google.com\n",
      "  - user:munn@google.com\n",
      "  role: roles/owner\n",
      "- members:\n",
      "  - serviceAccount:custom-vision@appspot.gserviceaccount.com\n",
      "  role: roles/serviceusage.serviceUsageAdmin\n",
      "- members:\n",
      "  - serviceAccount:service-171999062104@sourcerepo-service-accounts.iam.gserviceaccount.com\n",
      "  role: roles/sourcerepo.serviceAgent\n",
      "- members:\n",
      "  - serviceAccount:custom-vision@appspot.gserviceaccount.com\n",
      "  role: roles/storage.admin\n",
      "etag: BwV_ubbP9N0=\n",
      "version: 1\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud projects add-iam-policy-binding $PROJECT_ID \\\n",
    " --member \"serviceAccount:$SERVICE_ACCOUNT_EMAIL\" \\\n",
    " --role \"roles/owner\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create service account keys if not existing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Service key: /content/datalab/clones/document-processing/notebooks/entity-extractor.json\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "test -f $SERVICE_KEY_PATH || \n",
    "gcloud iam service-accounts keys create $SERVICE_KEY_PATH \\\n",
    "  --iam-account $SERVICE_ACCOUNT_EMAIL\n",
    "\n",
    "echo \"Service key: $(ls $SERVICE_KEY_PATH)\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make the key available to google clients for authentication"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = SERVICE_KEY_PATH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement a document classifier with AutoML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a simple wrapper around an already trained AutoML model trained directly\n",
    "from the cloud console on the various document types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import automl_v1beta1 as automl\n",
    "\n",
    "class DocumentClassifier:\n",
    "  \n",
    "  def __init__(self, project_id, model_id, zone):\n",
    "    self.client = automl.PredictionServiceClient()\n",
    "    self.model = self.client.model_path(project_id, zone, model_id)\n",
    "\n",
    "  def __call__(self, filename):\n",
    "    with open(filename, 'rb') as fp:\n",
    "        image = fp.read()\n",
    "    payload = {\n",
    "      'image': {\n",
    "        'image_bytes': image\n",
    "      }\n",
    "    }\n",
    "    response = self.client.predict(self.model, payload)\n",
    "    predicted_class = response.payload[0].display_name\n",
    "    return predicted_class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how to use that `DocumentClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EU patent inferred label: eu\n",
      "US patent inferred label: us\n"
     ]
    }
   ],
   "source": [
    "classifier = DocumentClassifier(PROJECT_ID, AUTOML_MODEL_ID, ZONE)\n",
    "\n",
    "eu_image_label = classifier(\"./eu_patent.png\")\n",
    "us_image_label = classifier(\"./us_patent.png\")\n",
    "\n",
    "print(\"EU patent inferred label:\", eu_image_label)\n",
    "print(\"US patent inferred label:\", us_image_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement a document parser with Vision API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Documentation:\n",
    "* https://cloud.google.com/vision/docs/base64\n",
    "* https://stackoverflow.com/questions/49918950/response-400-from-google-vision-api-ocr-with-a-base64-string-of-specified-image\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a simple class wrapping calls to the OCR capabilities of Cloud Vision:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "google-api-python-client==1.7.7\r\n"
     ]
    }
   ],
   "source": [
    "!pip freeze | grep google-api-python-client==1.7.7 || pip install --upgrade google-api-python-client==1.7.7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "import base64\n",
    "from googleapiclient.discovery import build\n",
    "\n",
    "class DocumentParser:\n",
    "  def __init__(self):\n",
    "    self.client = build('vision', 'v1')\n",
    "    \n",
    "  def __call__(self, filename):\n",
    "    with open(filename, 'rb') as fp:\n",
    "        image = fp.read()   \n",
    "    encoded_image = base64.b64encode(image).decode('UTF-8')\n",
    "    payload = {\n",
    "        'requests': [{\n",
    "                'image': {\n",
    "                    'content': encoded_image\n",
    "                },\n",
    "                'features': [{\n",
    "                    'type': 'TEXT_DETECTION',\n",
    "                }]\n",
    "            }],\n",
    "        }\n",
    "    request = self.client.images().annotate(body=payload)\n",
    "    response = request.execute(num_retries=3)\n",
    "    return response['responses'][0]['textAnnotations'][0]['description']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now see how to use our `DocumentParser`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(19)\n",
      "Europäisches\n",
      "Patentamt\n",
      "0\n",
      "European\n",
      "Patent Office\n",
      "Office européen\n",
      "des brevets\n",
      "EP 3 399 652 A1\n",
      "(12)\n",
      "EUROPEAN PATENT APPLICATION\n",
      "(43) Date of publication:\n",
      "(51) Int Cl.:\n",
      "07.11.2018 Bulletin 2018/45\n",
      "H04B 117107 (2011.01)H04B 117073 (2011.01)\n",
      "(21) Application number: 18180150.7\n",
      "(22) Date of filing: 24.06.2009\n",
      "(84) Designated Contracting States\n",
      ". YEE, Nathan, D.\n",
      "AT BE BG CH CY CZ DE DK EE ES FI FR GB GR\n",
      "HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL\n",
      "PT RO SE SI SK TR\n",
      "San Diego, CA 92121-1714 (US)\n",
      "SUBRAHMANYA, Parvathanathaln\n",
      "San Diego, CA 92121-1714 (US)\n",
      "(30) Priority: 25.06.2008 US 146232\n",
      "(74) Representative: Wegner, Hans\n",
      "Bardehle Pagenberg Partnerschaft mbB\n",
      "Patentanwälte, Rechtsanwälte\n",
      "Prinzregentenplatz 7\n",
      "81675 München (DE)\n",
      "(62) Document number(s) of the earlier application(s) in\n",
      "accordance with Art. 76 EPC:\n",
      "09770969.5/2311 196\n",
      "(71) Applicant: QUALCOMM Incorporated\n",
      "San Diego, CA 92121-1714 (US)\n",
      "Remarks:\n",
      "This application was filed on 27-06-2018 as a\n",
      "divisional application to the application mentioned\n",
      "under INID code 62\n",
      "(72) Inventors:\n",
      "SHEN, Qiang\n",
      "San Diego, CA 92121-1714 (US)\n",
      "(54)\n",
      "METHODS AND APPARATUS FOR COMMON CHANNEL CANCELLATION IN WIRELESS\n",
      "COMMUNICATIONS\n",
      "(57) A mobile station that is configured to perform\n",
      "common channel cancellation may include a parameter\n",
      "estimation unit that is configured to estimate parameters\n",
      "for generating a common channel error. The mobile sta\n",
      "tion may also include a common channel generation unit\n",
      "that is configured to generate the common channel error\n",
      "based on the parameters. The mobile station may also\n",
      "include an adder that is configured to subtract the com\n",
      "mon channel error from received data samples\n",
      "700\n",
      "702\n",
      "Estimate parameters for generating a common channel error\n",
      "704\n",
      "Generate the common channel error based on the parameters\n",
      "706\n",
      "Subtract the common channel error from received data samples\n",
      "FIG. 7\n",
      "Printed by Jouve, 75001 PARIS (FR)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "parser = DocumentParser()\n",
    "\n",
    "eu_patent_text = parser(\"./eu_patent.png\")\n",
    "us_patent_text = parser(\"./us_patent.png\")\n",
    "\n",
    "print(eu_patent_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implement the rule-based extractors for each document categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each patent type, we now want to write a simple function that takes the \n",
    "text extracted by the OCR system above and extract the name and date of the patent.\n",
    "\n",
    "We will write two rule base extractors, one for each type of patent (us or eu), each of\n",
    "which will yield a PatentInfo object collecting the extracted object into a `nametuple` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import namedtuple\n",
    "\n",
    "PatentInfo = namedtuple('PatentInfo', ['filename', 'category', 'date', 'number'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are two helper functions for text splitting and pattern matching:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pandas==0.23.4\r\n"
     ]
    }
   ],
   "source": [
    "!pip freeze | grep pandas==0.23.4 || pip install --upgrade pandas==0.23.4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "\n",
    "def split_text_into_lines(text, sep=\"\\(..\\)\"):\n",
    "  lines = [line.strip() for line in re.split(sep, text)]\n",
    "  return lines\n",
    "\n",
    "\n",
    "def extract_pattern_from_lines(lines, pattern):\n",
    "  \"\"\"Extracts the first line from `text` with a matching `pattern`.\n",
    "  \"\"\"\n",
    "  lines = pd.Series(lines)\n",
    "  mask = lines.str.contains(pattern)\n",
    "  return lines[mask].values[0] if mask.any() else None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### European patent extractor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_info_from_eu_patent(filename, text):\n",
    "  lines = split_text_into_lines(text)\n",
    "\n",
    "  category = \"eu\"\n",
    "  \n",
    "  number_paragraph = extract_pattern_from_lines(lines, \"EP\")\n",
    "  number_lines = number_paragraph.split('\\n')\n",
    "  number = extract_pattern_from_lines(number_lines, 'EP')  \n",
    "  \n",
    "  date_paragraph = extract_pattern_from_lines(lines, 'Date of filing:')\n",
    "  date = date_paragraph.replace(\"Date of filing:\", \"\").strip()\n",
    "  \n",
    "  return PatentInfo(\n",
    "    filename=filename,\n",
    "    category=category,\n",
    "    date=date,\n",
    "    number=number\n",
    "  )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PatentInfo(filename='./eu_patent.png', category='eu', date='24.06.2009', number='EP 3 399 652 A1')"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eu_patent_info = extract_info_from_eu_patent(\"./eu_patent.png\", eu_patent_text)\n",
    "eu_patent_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### US patent extractor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_info_from_us_patent(filename, text):\n",
    "  lines = split_text_into_lines(text)\n",
    "\n",
    "  category = \"us\"\n",
    "  \n",
    "  number_paragraph = extract_pattern_from_lines(lines, \"Patent No.:\")\n",
    "  number = number_paragraph.replace(\"Patent No.:\", \"\").strip()\n",
    "  \n",
    "  date_paragraph = extract_pattern_from_lines(lines, \"Date of Patent:\")\n",
    "  date = date_paragraph.split('\\n')[-1]\n",
    "  \n",
    "  return PatentInfo(\n",
    "    filename=filename,\n",
    "    category=category,\n",
    "    date=date,\n",
    "    number=number\n",
    "  )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PatentInfo(filename='./us_patent.png', category='us', date='Nov. 27, 2018', number='US 10,142,913 B2')"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "us_patent_info = extract_info_from_us_patent(\"./us_patent.png\", us_patent_text)\n",
    "us_patent_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tie all together into a DocumentExtractor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DocumentExtractor:\n",
    "  def __init__(self, classifier, parser):\n",
    "    self.classifier = classifier\n",
    "    self.parser = parser\n",
    "\n",
    "  def __call__(self, filename):\n",
    "\n",
    "    text = self.parser(filename)\n",
    "    label = self.classifier(filename)\n",
    "    \n",
    "    if label == 'eu':\n",
    "      info = extract_info_from_eu_patent(filename, text)\n",
    "    elif label == 'us':\n",
    "      info = extract_info_from_us_patent(filename, text)\n",
    "    else:\n",
    "      raise ValueError\n",
    "        \n",
    "    return info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PatentInfo(filename='./eu_patent.png', category='eu', date='24.06.2009', number='EP 3 399 652 A1')\n",
      "PatentInfo(filename='./us_patent.png', category='us', date='Nov. 27, 2018', number='US 10,142,913 B2')\n"
     ]
    }
   ],
   "source": [
    "extractor = DocumentExtractor(classifier, parser)\n",
    "\n",
    "eu_patent_info = extractor(\"./eu_patent.png\")\n",
    "us_patent_info = extractor(\"./us_patent.png\")\n",
    "\n",
    "print(eu_patent_info)\n",
    "print(us_patent_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload found entites to BigQuery\n",
    "Start by adding a dataset called patents to the current project"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "google-cloud-bigquery==1.8.1\r\n"
     ]
    }
   ],
   "source": [
    "!pip freeze | grep google-cloud-bigquery==1.8.1 || pip install google-cloud-bigquery==1.8.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check to see if the dataset called \"patents\" exists in the current project. If not, create it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The dataset \"patents\" already exists in project asl-open-projects.\n"
     ]
    }
   ],
   "source": [
    "from google.cloud import bigquery\n",
    "\n",
    "\n",
    "client = bigquery.Client()\n",
    "\n",
    "# Collect datasets and project information\n",
    "datasets = list(client.list_datasets())\n",
    "project = client.project\n",
    "\n",
    "# Create a list of the datasets. If the 'patents' dataset \n",
    "# does not exist, then create it.\n",
    "if datasets:\n",
    "    all_datasets = []\n",
    "    for dataset in datasets:\n",
    "        all_datasets.append(dataset.dataset_id)\n",
    "else:\n",
    "    print('{} project does not contain any datasets.'.format(project))\n",
    "\n",
    "if datasets and 'patents' in all_datasets:\n",
    "    print('The dataset \"patents\" already exists in project {}.'.format(project))\n",
    "else:\n",
    "    dataset_id = 'patents'\n",
    "    dataset_ref = client.dataset(dataset_id)\n",
    "\n",
    "    # Construct a Dataset object.\n",
    "    dataset = bigquery.Dataset(dataset_ref)\n",
    "\n",
    "    # Specify the geographic location where the dataset should reside.\n",
    "    dataset.location = \"US\"\n",
    "\n",
    "    # Send the dataset to the API for creation.\n",
    "    dataset = client.create_dataset(dataset)  # API request\n",
    "    print('The dataset \"patents\" was created in project {}.'.format(project))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upload the extracted entities to a table called \"found_entities\" in the \"patents\" dataset.\n",
    "\n",
    "Start by creating an empty table in the patents dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an empty table in the patents dataset and define schema\n",
    "dataset_ref = client.dataset('patents')\n",
    "\n",
    "schema = [\n",
    "    bigquery.SchemaField('filename', 'STRING', mode='NULLABLE'),\n",
    "    bigquery.SchemaField('category', 'STRING', mode='NULLABLE'),\n",
    "    bigquery.SchemaField('date', 'STRING', mode='NULLABLE'),\n",
    "    bigquery.SchemaField('number', 'STRING', mode='NULLABLE'),\n",
    "]\n",
    "table_ref = dataset_ref.table('found_entities')\n",
    "table = bigquery.Table(table_ref, schema=schema)\n",
    "table = client.create_table(table)  # API request\n",
    "\n",
    "assert table.table_id == 'found_entities'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "def upload_to_bq(patent_info, dataset_id, table_id):\n",
    "    \"\"\"Appends the information extracted in patent_info into the \n",
    "    dataset_id:table_id in BigQuery.\n",
    "    patent_info should be a namedtuple as created above and should\n",
    "    have components matching the schema set up for the table\n",
    "    \"\"\"\n",
    "    table_ref = client.dataset(dataset_id).table(table_id)\n",
    "    table = client.get_table(table_ref)  # API request\n",
    "\n",
    "    rows_to_insert = [tuple(patent_info._asdict().values())]\n",
    "\n",
    "    errors = client.insert_rows(table, rows_to_insert)  # API request\n",
    "\n",
    "    assert errors == []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "upload_to_bq(eu_patent_info, 'patents', 'found_entities')\n",
    "upload_to_bq(us_patent_info, 'patents', 'found_entities')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examine the resuts in BigQuery. \n",
    "We can now query the BigQuery table to see what values have been uploaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filename</th>\n",
       "      <th>category</th>\n",
       "      <th>date</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>./us_patent.png</td>\n",
       "      <td>us</td>\n",
       "      <td>Nov. 27, 2018</td>\n",
       "      <td>US 10,142,913 B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>./eu_patent.png</td>\n",
       "      <td>eu</td>\n",
       "      <td>24.06.2009</td>\n",
       "      <td>EP 3 399 652 A1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>./eu_patent.png</td>\n",
       "      <td>eu</td>\n",
       "      <td>24.06.2009</td>\n",
       "      <td>EP 3 399 652 A1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>./us_patent.png</td>\n",
       "      <td>us</td>\n",
       "      <td>Nov. 27, 2018</td>\n",
       "      <td>US 10,142,913 B2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          filename category           date            number\n",
       "0  ./us_patent.png       us  Nov. 27, 2018  US 10,142,913 B2\n",
       "1  ./eu_patent.png       eu     24.06.2009   EP 3 399 652 A1\n",
       "2  ./eu_patent.png       eu     24.06.2009   EP 3 399 652 A1\n",
       "3  ./us_patent.png       us  Nov. 27, 2018  US 10,142,913 B2"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    *\n",
    "FROM `asl-open-projects.patents.found_entities`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also look at the resulting entities in a dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filename</th>\n",
       "      <th>category</th>\n",
       "      <th>date</th>\n",
       "      <th>number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>./eu_patent.png</td>\n",
       "      <td>eu</td>\n",
       "      <td>24.06.2009</td>\n",
       "      <td>EP 3 399 652 A1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>./us_patent.png</td>\n",
       "      <td>us</td>\n",
       "      <td>Nov. 27, 2018</td>\n",
       "      <td>US 10,142,913 B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>./us_patent.png</td>\n",
       "      <td>us</td>\n",
       "      <td>Nov. 27, 2018</td>\n",
       "      <td>US 10,142,913 B2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>./eu_patent.png</td>\n",
       "      <td>eu</td>\n",
       "      <td>24.06.2009</td>\n",
       "      <td>EP 3 399 652 A1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          filename category           date            number\n",
       "0  ./eu_patent.png       eu     24.06.2009   EP 3 399 652 A1\n",
       "1  ./us_patent.png       us  Nov. 27, 2018  US 10,142,913 B2\n",
       "2  ./us_patent.png       us  Nov. 27, 2018  US 10,142,913 B2\n",
       "3  ./eu_patent.png       eu     24.06.2009   EP 3 399 652 A1"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset_id = 'patents'\n",
    "table_id = 'found_entities'\n",
    "\n",
    "sql = \"\"\"\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    `{}.{}.{}`\n",
    "LIMIT 10\n",
    "\"\"\".format(project, dataset_id, table_id)\n",
    "\n",
    "df = client.query(sql).to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: We should include some section on how to evaluate the performance of the extractor. Here we can use the ground_truth table and explore different kinds of string metrics (e.g. Levenshtein distance) to measure accuracy of the entity extraction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To remove the table \"found_entities\" from the \"patents\" dataset created above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "These tables were found in the patents dataset: ['found_entities', 'ground_truth']\n",
      "Table found_entities was deleted from dataset patents.\n"
     ]
    }
   ],
   "source": [
    "dataset_id = 'patents'\n",
    "table_id = 'found_entities'\n",
    "\n",
    "tables = list(client.list_tables(dataset_id))  # API request(s)\n",
    "\n",
    "if tables:\n",
    "    num_tables = len(tables)\n",
    "    all_tables = []\n",
    "    for _ in range(num_tables):\n",
    "        all_tables.append(tables[_].table_id)\n",
    "    print('These tables were found in the {} dataset: {}'.format(dataset_id,all_tables))\n",
    "    if table_id in all_tables:\n",
    "        table_ref = client.dataset(dataset_id).table(table_id)\n",
    "        client.delete_table(table_ref)  # API request\n",
    "        print('Table {} was deleted from dataset {}.'.format(table_id, dataset_id))\n",
    "else:\n",
    "    print('{} dataset does not contain any tables.'.format(dataset_id))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The next cells will remove the patents dataset and all of its tables. Not recommended as I recently uploaded a talbe of 'ground_truth' for entities in the files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To remove the \"patents\" dataset and all of its tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nclient = bigquery.Client()\\n# Collect datasets and project information\\ndatasets = list(client.list_datasets())\\nproject = client.project\\n\\nif datasets:\\n    all_datasets = []\\n    for dataset in datasets:\\n        all_datasets.append(dataset.dataset_id)\\n    if \\'patents\\' in all_datasets:\\n        # Delete the dataset \"patents\" and its contents\\n        dataset_id = \\'patents\\'\\n        dataset_ref = client.dataset(dataset_id)\\n        client.delete_dataset(dataset_ref, delete_contents=True)\\n        print(\\'Dataset {} deleted from project {}.\\'.format(dataset_id, project))\\n    else: print(\\'{} project does not contain the \"patents\" datasets.\\'.format(project))\\nelse:\\n    print(\\'{} project does not contain any datasets.\\'.format(project))\\n'"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'''\n",
    "client = bigquery.Client()\n",
    "# Collect datasets and project information\n",
    "datasets = list(client.list_datasets())\n",
    "project = client.project\n",
    "\n",
    "if datasets:\n",
    "    all_datasets = []\n",
    "    for dataset in datasets:\n",
    "        all_datasets.append(dataset.dataset_id)\n",
    "    if 'patents' in all_datasets:\n",
    "        # Delete the dataset \"patents\" and its contents\n",
    "        dataset_id = 'patents'\n",
    "        dataset_ref = client.dataset(dataset_id)\n",
    "        client.delete_dataset(dataset_ref, delete_contents=True)\n",
    "        print('Dataset {} deleted from project {}.'.format(dataset_id, project))\n",
    "    else: print('{} project does not contain the \"patents\" datasets.'.format(project))\n",
    "else:\n",
    "    print('{} project does not contain any datasets.'.format(project))\n",
    "'''"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
